{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# \ud83d\udcc3 Solution for Exercise M6.02\n",
    "\n",
    "The aim of this exercise it to explore some attributes available in\n",
    "scikit-learn's random forest.\n",
    "\n",
    "First, we will fit the penguins regression dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "penguins = pd.read_csv(\"../datasets/penguins_regression.csv\")\n",
    "feature_name = \"Flipper Length (mm)\"\n",
    "target_name = \"Body Mass (g)\"\n",
    "data, target = penguins[[feature_name]], penguins[target_name]\n",
    "data_train, data_test, target_train, target_test = train_test_split(\n",
    "    data, target, random_state=0\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"admonition note alert alert-info\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
    "<p class=\"last\">If you want a deeper overview regarding this dataset, you can refer to the\n",
    "Appendix - Datasets description section at the end of this MOOC.</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a random forest containing three trees. Train the forest and check the\n",
    "generalization performance on the testing set in terms of mean absolute error."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# solution\n",
    "from sklearn.metrics import mean_absolute_error\n",
    "from sklearn.ensemble import RandomForestRegressor\n",
    "\n",
    "forest = RandomForestRegressor(n_estimators=3)\n",
    "forest.fit(data_train, target_train)\n",
    "target_predicted = forest.predict(data_test)\n",
    "print(\n",
    "    \"Mean absolute error: \"\n",
    "    f\"{mean_absolute_error(target_test, target_predicted):.3f} grams\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now aim to plot the predictions from the individual trees in the forest.\n",
    "For that purpose you have to create first a new dataset containing evenly\n",
    "spaced values for the flipper length over the interval between 170 mm and 230\n",
    "mm."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# solution\n",
    "import numpy as np\n",
    "\n",
    "data_range = pd.DataFrame(np.linspace(170, 235, num=300), columns=data.columns)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The trees contained in the forest that you created can be accessed with the\n",
    "attribute `estimators_`. Use them to predict the body mass corresponding to\n",
    "the values in this newly created dataset. Similarly find the predictions of\n",
    "the random forest in this dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# solution\n",
    "tree_predictions = []\n",
    "\n",
    "for tree in forest.estimators_:\n",
    "    # we convert `data_range` into a NumPy array to avoid a warning raised in scikit-learn\n",
    "    tree_predictions.append(tree.predict(data_range.to_numpy()))\n",
    "\n",
    "forest_predictions = forest.predict(data_range)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now make a plot that displays:\n",
    "- the whole `data` using a scatter plot;\n",
    "- the decision of each individual tree;\n",
    "- the decision of the random forest."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# solution\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "\n",
    "sns.scatterplot(\n",
    "    data=penguins, x=feature_name, y=target_name, color=\"black\", alpha=0.5\n",
    ")\n",
    "\n",
    "# plot tree predictions\n",
    "for tree_idx, predictions in enumerate(tree_predictions):\n",
    "    plt.plot(\n",
    "        data_range[feature_name],\n",
    "        predictions,\n",
    "        label=f\"Tree #{tree_idx}\",\n",
    "        linestyle=\"--\",\n",
    "        alpha=0.8,\n",
    "    )\n",
    "\n",
    "plt.plot(data_range[feature_name], forest_predictions, label=\"Random forest\")\n",
    "_ = plt.legend(bbox_to_anchor=(1.05, 0.8), loc=\"upper left\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "source": [
    "The random forest reduces the overfitting of the individual trees but still\n",
    "overfits itself. In the section on \"hyperparameter tuning with ensemble\n",
    "methods\" we will see how to further mitigate this effect. Still, interested\n",
    "users may increase the number of estimators in the forest and try different\n",
    "values of, e.g., `min_samples_split`."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}